Skip to content

Supporting FIPO (Future-KL Influenced Policy Optimization)#1801

Open
SeungyounShin wants to merge 9 commits intoTHUDM:mainfrom
SeungyounShin:feature/fipo-loss
Open

Supporting FIPO (Future-KL Influenced Policy Optimization)#1801
SeungyounShin wants to merge 9 commits intoTHUDM:mainfrom
SeungyounShin:feature/fipo-loss

Conversation

@SeungyounShin
Copy link
Copy Markdown

@SeungyounShin SeungyounShin commented Apr 3, 2026

Summary

Add FIPO (Future-KL Influenced Policy Optimization) as a built-in loss type, enabling dense token-level credit assignment for RL training without a value network.

Ma et al., "FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization" (arXiv:2603.19835)

Method

FIPO extends standard clipped PPO by re-weighting per-token advantages with Future-KL influence weights. While GRPO broadcasts a uniform sequence-level advantage to all tokens, FIPO modulates each token's contribution based on how the future trajectory evolves:

$$\text{FutureKL}_t = \sum_{k=t}^{T} M_k \cdot \gamma^{k-t} \cdot \Delta \log p_k$$

where $\Delta \log p_k = \log \pi_\theta(y_k \mid y_{<k}) - \log \pi_{\theta_{\text{old}}}(y_k \mid y_{<k})$ and $\gamma = 2^{-1/\tau}$ is an exponential decay factor.

The influence weight is computed as:

$$f_t = \text{clip}\left(\exp(\text{FutureKL}_t),; 1 - \epsilon_{f},; 1 + \epsilon_{f}\right)$$

The final FIPO loss modifies the clipped PPO objective with re-weighted advantages:

$$\mathcal{L}_t^{\text{FIPO}} = \min\left(r_t \cdot \hat{A}_t \cdot f_t,;; \text{clip}(r_t, 1{-}\epsilon, 1{+}\epsilon) \cdot \hat{A}_t \cdot f_t\right)$$

Key design choices

  • influence_weights are detached from the computation graph — FIPO does not backpropagate through Future-KL
  • Requires multi-step training per rollout (global_batch_size < rollout_batch_size × n_samples_per_prompt) so that $\theta$ drifts from $\theta_{\text{old}}$, making Future-KL non-trivial
  • Uses chunked matrix multiplication for memory-efficient $O(B \cdot L^2)$ Future-KL computation

Changes

File Description
slime/backends/megatron_utils/loss.py _compute_future_kl() and fipo_loss_function() with dual-clip PPO, sequence filtering, and FIPO metrics
slime/utils/arguments.py --loss-type fipo_loss and 6 FIPO-specific args (--fipo-decay-rate, --fipo-chunk-size, --fipo-clip-ratio, --fipo-clip-high-only, --fipo-safety-thresh, --fipo-dual-clip-c)
scripts/models/qwen3.5-2B.sh Qwen3.5-2B model config
examples/fipo/fipo_qwen3.5_2b.sh Training example script

Usage

# Use with GRPO advantage estimator
--loss-type fipo_loss \
--advantage-estimator grpo \
--fipo-decay-rate 32.0 \
--fipo-chunk-size 128 \
--fipo-clip-ratio 0.2 \
--fipo-safety-thresh 3.0 \
--fipo-dual-clip-c 10.0 \
--global-batch-size 64   # must be < rollout_batch_size × n_samples_per_prompt

FIPO hyperparameter guide

Parameter 7B and below 32B and above Description
--fipo-decay-rate 32 32 Half-life $\tau$ for exponential decay
--fipo-clip-ratio 0.2 0.2 Influence weight clip range
--fipo-clip-high-only Clip only upper bound $[1.0, 1.2]$ for larger models
--fipo-safety-thresh 3.0 10.0 Cap high-IS negative samples

Results

FIPO consistently increases raw reward (Qwen3.5-4B-Base)

image

A token-level importance-ratio heatmap under FIPO. The strong emphasis on “Alternative” suggests that FIPO reinforces branching points that trigger self-reflection and exploration of alternative solution paths, rather than treating all tokens uniformly.

cc. @zhuzilin

SeungyounShin and others added 4 commits April 3, 2026 01:21
Add FIPO as a built-in loss type for dense token-level credit assignment
without a value network. FIPO re-weights GRPO advantages using discounted
Future-KL divergence, enabling deeper reasoning chains.

Changes:
- loss.py: fipo_loss_function with chunked Future-KL computation
- arguments.py: --loss-type fipo_loss and 6 FIPO-specific args
- scripts/models/qwen3.5-2B.sh: Qwen3.5-2B model config
- examples/fipo/: sbatch training script for Qwen3.5-2B on HPC

Requires multi-step training per rollout (global-batch-size < total
rollout samples) so the policy drifts from the rollout policy.

Ref: Ma et al., arXiv:2603.19835

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Use absolute paths (SLIME_ROOT) instead of relative SCRIPT_DIR
- Add venv activation and flash-attn install on compute node
- Set SGLANG_DISABLE_CUDNN_CHECK=1 for driver compatibility
- Use 8 GPUs per node (H100x8 cluster)
- Add working-dir for ray job submit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add HF→torch_dist checkpoint conversion step
- Switch to conda env activation via PATH
- Remove hardcoded wandb key
- Remove flash-attn build step (pre-built wheel installed)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace hardcoded paths with configurable variables
- Remove SLURM/HPC-specific configuration
- Add usage instructions in header comments
- Add .gitignore for local run scripts (*.local.sh)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@SeungyounShin SeungyounShin changed the title Feature/fipo loss Supporting FIPO Apr 3, 2026
SeungyounShin and others added 2 commits April 3, 2026 05:11
- Assert rollout_log_probs is available (required for multi-step FutureKL)
- Simplify loss masking: pg_losses * final_mask (was needlessly complex)
- Add --use-rollout-logprobs to example scripts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SeungyounShin and others added 3 commits April 3, 2026 06:31
Replace --fipo-clip-ratio with --fipo-clip-ratio-low and
--fipo-clip-ratio-high to allow asymmetric clipping of influence
weights. This lets ε_high > ε_low, favoring amplification of good
trajectories (FutureKL > 0) over attenuation of bad ones.

Example: --fipo-clip-ratio-low 0.2 --fipo-clip-ratio-high 0.28
gives f_t ∈ [0.8, 1.28], matching PPO's asymmetric eps_clip pattern.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…nfig

- Add _fipo_log_token_ratios: HTML heatmap of per-token log ratio
  logged to wandb every 8 steps. Green = reinforced, red = suppressed.
  Hover shows exact ratio and f_t values.
- Add Qwen3.5-0.8B model config script
- Update run script to support MODEL_SIZE env var (0.8B/2B)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove _fipo_log_token_ratios (wandb HTML heatmap) from core loss.
Keep only essential FIPO metrics (influence_mean/min/max).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@SeungyounShin SeungyounShin marked this pull request as ready for review April 3, 2026 09:28
@SeungyounShin SeungyounShin changed the title Supporting FIPO Supporting FIPO (Future-KL Influenced Policy Optimization) Apr 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant